Finding Ideographic Representations of Japanese Names Written in Latin Script via Language Identification and Corpus Validation
نویسندگان
چکیده
Multilingual applications frequently involve dealing with proper names, but names are often missing in bilingual lexicons. This problem is exacerbated for applications involving translation between Latin-scripted languages and Asian languages such as Chinese, Japanese and Korean (CJK) where simple string copying is not a solution. We present a novel approach for generating the ideographic representations of a CJK name written in a Latin script. The proposed approach involves first identifying the origin of the name, and then back-transliterating the name to all possible Chinese characters using language-specific mappings. To reduce the massive number of possibilities for computation, we apply a three-tier filtering process by filtering first through a set of attested bigrams, then through a set of attested terms, and lastly through the WWW for a final validation. We illustrate the approach with English-to-Japanese back-transliteration. Against test sets of Japanese given names and surnames, we have achieved average precisions of 73% and 90%, respectively.
منابع مشابه
Using Transliteration of Proper Names from Arabic to Latin Script to Improve English-Arabic Word Alignment
Bilingual lexicons of proper names play a vital role in machine translation and cross-language information retrieval. Word alignment approaches are generally used to construct bilingual lexicons automatically from parallel corpora. Aligning proper names is a task particularly difficult when the source and target languages of the parallel corpus do not share a same written script. We present in ...
متن کاملStudy of the impact of proper name transliteration on the performance of word alignment in French-Arabic parallel corpora (Etude de l'impact de la translittération de noms propres sur la qualité de l'alignement de mots à partir de corpus parallèles français-arabe) [in French]
Bilingual lexicons play a vital role in cross-language information retrieval and machine translation. The manual construction of these lexicons is often costly and time consuming. Word alignment techniques are generally used to construct bilingual lexicons from parallel texts. Aligning single words and nominal syntagms from parallel texts is relatively a well controlled task for languages using...
متن کاملTranslating Chinese Romanized Name into Chinese Idiographic Characters via Corpus and Web Validation
Cross-language information retrieval performance depends on the quality of the translation resources used to pass from a user’s source language query to target language documents. Translation lists of proper names are rare but vital resources for cross-language retrieval between languages using different character sets. Named entities translation dictionaries can be extracted from bilingual cor...
متن کاملThe role of interword spacing in reading Japanese: An eye movement study
The present study investigated the role of interword spacing in a naturally unspaced language, Japanese. Eye movements were registered of native Japanese readers reading pure Hiragana (syllabic) and mixed Kanji-Hiragana (ideographic and syllabic) text in spaced and unspaced conditions. Interword spacing facilitated both word identification and eye guidance when reading syllabic script, but not ...
متن کاملDependency Analysis of Japanese Spoken Language via SVM
This paper discuss a dependency analyzer employing Support Vector Machines (SVMs) for Japanese spoken language. Most conventional dependency analyzers target written texts. Thus, we use a currently available spoken language corpus and make the SVMs learn the corpus to build a dependency analyzer that targets spoken language. We used two types of corpora: one contains written language, and the o...
متن کامل